25  Predictor variables

Learning Objectives

After completing this tutorial you should be able to

Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

There should be a file named 25_predictors.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • You may need to install a few additional packages before we get started. Check the code chunk below and install any packages you still need.

    # load libraries ----
    
    # reporting
    library(knitr)
    
    # visualization
    library(ggplot2)
    library(ggthemes)
    library(patchwork)
    
    # data wrangling
    library(dplyr)
    library(tidyr)
    library(readr)
    library(skimr)
    library(janitor)
    library(magrittr)
    library(lubridate)
    
    # modeling
    library(corrplot)
    library(GGally)
    
    # set other options ----
    
    # scientific notation
    options(scipen=999)
    
    # turn off messages and warnings
    knitr::opts_chunk$set(
      tidy = FALSE, 
      message = FALSE,
        warning = FALSE)
    
    # set the ggplot theme for the document
    theme_set(theme_bw())

    25.1 Machine learning and predictive modeling

    We are going to use machine learning for to build and train our predictive modeling.

    You might recall from your biostats class that typically we draw a sample from a population and then we use what we learn from the sample to make inferences about the population that they represent, i.e. we can infer the average value of an observation in the (unsampled) population based on the sample2.

  • 2 Recall, for linear regressions we use the formula “on average for an increase of one unit of the independent variable we expect the independent variable to increase by the value of the slope” to describe out results.

  • For prediction we also start with a sample, however, in this case the goal is to be able to predict a single observation’s value of a given characteristic based on the characteristics of the observations in the sample - our goal is not to make a statement about what the average expectation for the value of an observation is, we want to predict the actual value for that observation.

    In the case at hand, we have our continuous outcome variable that we want to predict (air pollution levels) which we will denote as \(Y\). \(X\) would be our set of predictor variables with \(X_{1}, ... X_{p}\) denoting individual features (development, population density …).

    Our goal is to describe a rule (machine learning algorithm) that takes values for our features as input and the predicts the air pollution (outcome variable) for situations where air pollution is unknown (\(\hat{Y}\)) . This process is what we generally describe as training a model and means that we are using a data set where both \(X\) and \(Y\) are known to estimate a function \(\hat{Y} = f(X)\) that uses the predictor variables as input.

    The better our model, the more closely our predicted outcome \(\hat{Y}\) should match our actual outcome \(Y\). In other words we are trying to minimize the distance between \(\hat{Y} = f(X)\) and \(Y\)3. The choice of the distance metric (\(d(Y - f(X))\)) is frequently the absolute or squared difference.

  • 3 This is what is referred to as an optimization problem.

  • In order to set up (and then solve!) a typical ML problem we need four components:

    1. a training data set (i.e. matching data set of outcome and predictor variables)
    2. a (set of) algorithms to try values of \(f\).
    3. A distance metric \(d\) to measure how closely \(Y\) and \(\hat{Y}\) match.
    4. A definition of a “good” (acceptable) distance.

    25.2 Data for predictive modeling

    We will use a data set gathered using air pollution monitoring. In this system of monitors about 90% are located within cities leading to rural areas being severely under-monitored. Our goal is to use this data set to train model that can accurately predict air pollution levels even when physical monitors are not present. A primary interest in air pollution is due to the adverse health outcomes related to exposure.

    Before we start, we should consider limitations of the data set we will look at to make sure that we can answer our question and to make sure that we don’t overstep in our interpretation.

    Consider this

    Consider limitations of the data that you should keep in mind when interpreting and discussing your results.

    Did it!

    [Your answer here]

    Here are some factors to consider:

    1. What limitations come with the fact that air pollutants are measured as “Particulate Matter”?
    2. We are measuring (average) outdoor pollution levels; how does this related to making predictions about individual exposures?
    3. We are using mean estimates of pollution levels.

    We will be using supervised machine learning to predict pollution levels. To do this, we need two main types of data:

    1. A continuous outcome variable - this is the variable we want to predict.
    2. A set of features or predictor variables used to predict the outcome variable.

    In order to build and train a model you need corresponding data sets4. The underlying principle is that if you determine the existing relationship and describe it mathematically using an existing data set, then you would also be able predict the value for that outcome variable for a new observation as long as you have values for the predictor variable.

  • 4 Corresponding means they should have the same/very similar spatial and temporal resolution.

  • Consider this

    For our example of creating a model with the goal of predicting pollution levels, what would be our outcome variable and what are potential features (or predictor variables)?

    Did it!

    [Your answer here]

    Consider what we want to predict (What is the outcome of the model? What quantity do we hope to get as a result?) and which parameters might contribute to air pollution levels (i.e. factors that might be causing it to go up or down/are correlated to air pollution).

    With the rise of computational power available at our fingertips, Machine Learning and Artificial Intelligence approaches are increasingly being used to solve problems, especially when large data sets are involved. Unfortunately, this means it quickly turns into a black box where we dump in some outcome and predictor variables give it a good shake and take the “answer” we receive at face value.

    To avoid this we need to start with a specific question in mind and carefully consider how our outcome and features are related. Good questions have a plausible explanation for why features predict the outcome and critically evaluate potential for variation in both the features and outcome over time.

    We will be using a data set that was previously compiled by Roger Peng, a scientist who studies air pollution, climate change and public health. We will import a data set to work with that has already combined monitoring data and possible predictor variables that have been compiled by a range of sources.

    The monitoring data comes from gravimetric monitors operated by the EPA that are designed to capture fine particulate matter (PM2.5) using a filtration system. Values are measure either daily or weekly. Our feature data set contains values for each of the 876 monitors (observations) and has been compiled from the EPA, NASA, US Census and National Center for Health Statistics. Most of the features have been taken for a circular area that surround the monitor itself (buffer).

    For this module we will explore geographic patterns of air pollution and determine whether we can predict levels of pollution for areas where we do not have monitors based on available information from areas that do have monitors. Specifically, we will answer the central question Can we predict annual average air pollution concentrations at the resolution of zip code regional levels using predictor variables describing population density, urbanization, road density, satellite pollution data, and chemical modeling data?

    25.3 Explore the data set

    Let’s import the data set.

    pm <- read_delim("data/air_pollution.csv", delim = ",") %>%
      clean_names()

    Let’s take a quick look at the data types to make sure everything read in correctly.

    pm %>%
      glimpse()
    Rows: 876
    Columns: 50
    $ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
    $ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
    $ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
    $ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
    $ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
    $ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
    $ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
    $ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
    $ cmaq                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
    $ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
    $ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
    $ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
    $ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
    $ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
    $ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
    $ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
    $ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
    $ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
    $ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
    $ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
    $ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
    $ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
    $ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
    $ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
    $ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
    $ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
    $ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
    $ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
    $ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
    $ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
    $ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
    $ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
    $ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
    $ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
    $ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
    $ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
    $ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
    $ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
    $ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
    $ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
    $ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
    $ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
    $ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
    $ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
    $ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
    $ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
    $ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
    $ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
    $ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
    $ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…
    Consider this

    Describe your data set, pay specific attention to the following questions to make sure you have oriented yourself:

    • How many monitors do we have in the data set?
    • Which column contains our outcome variable?
    • Which contains our predictor variables?
    Did it!

    [Your answer here]

    • The column value contains PM2.5 concentrations in mass concentration (µg/m3) - this is our outcome variable.
    • The column id codes for the monitors.
    • The remaining columns give us some additional information on the location where the monitor is located - we will have to assess which of those could be good predictor variables.

    Here are the specifics for each column:

    Variable Details
    id Monitor number
    – the county number is indicated before the decimal
    – the monitor number is indicated after the decimal
    Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
    fips Federal information processing standard number for the county where the monitor is located
    – 5 digit id code for counties (zero is often the first value and sometimes is not shown)
    – the first 2 numbers indicate the state
    – the last three numbers indicate the county
    Example: Alabama’s state code is 01 because it is first alphabetically
    (note: Alaska and Hawaii are not included because they are not part of the contiguous US)
    Lat Latitude of the monitor in degrees
    Lon Longitude of the monitor in degrees
    state State where the monitor is located
    county County where the monitor is located
    city City where the monitor is located
    CMAQ Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
    – A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
    Does not use any of the PM2.5 gravimetric monitoring data. (There is a version that does use the gravimetric monitoring data, but not this one!)
    – Data from the EPA
    zcta Zip Code Tabulation Area where the monitor is located
    – Postal Zip codes are converted into “generalized areal representations” that are non-overlapping
    – Data from the 2010 Census
    zcta_area Land area of the zip code area in meters squared
    – Data from the 2010 Census
    zcta_pop Population in the zip code area
    – Data from the 2010 Census
    imp_a500 Impervious surface measure
    – Within a circle with a radius of 500 meters around the monitor
    – Impervious surface are roads, concrete, parking lots, buildings
    – This is a measure of development
    imp_a1000 Impervious surface measure
    – Within a circle with a radius of 1000 meters around the monitor
    imp_a5000 Impervious surface measure
    – Within a circle with a radius of 5000 meters around the monitor
    imp_a10000 Impervious surface measure
    – Within a circle with a radius of 10000 meters around the monitor
    imp_a15000 Impervious surface measure
    – Within a circle with a radius of 15000 meters around the monitor
    county_area Land area of the county of the monitor in meters squared
    county_pop Population of the county of the monitor
    Log_dist_to_prisec Log (Natural log) distance to a primary or secondary road from the monitor
    – Highway or major road
    log_pri_length_5000 Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
    – Highways only
    log_pri_length_10000 Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
    – Highways only
    log_pri_length_15000 Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
    – Highways only
    log_pri_length_25000 Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
    – Highways only
    log_prisec_length_500 Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_prisec_length_1000 Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_prisec_length_5000 Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_prisec_length_10000 Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_prisec_length_15000 Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_prisec_length_25000 Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
    – Highway and secondary roads
    log_nei_2008_pm25_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
    log_nei_2008_pm25_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
    log_nei_2008_pm25_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
    log_nei_2008_pm10_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
    log_nei_2008_pm10_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
    log_nei_2008_pm10_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
    popdens_county Population density (number of people per kilometer squared area of the county)
    popdens_zcta Population density (number of people per kilometer squared area of zcta)
    nohs Percentage of people in zcta area where the monitor is that do not have a high school degree
    – Data from the Census
    somehs Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
    – Data from the Census
    hs Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
    – Data from the Census
    somecollege Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
    – Data from the Census
    associate Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
    – Data from the Census
    bachelor Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
    – Data from the Census
    grad Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
    – Data from the Census
    pov Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines
    – Data from the Census
    hs_orless Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
    urc2013 2013 Urban-rural classification of the county where the monitor is located
    – 6 category variable - 1 is totally urban 6 is completely rural
    – Data from the National Center for Health Statistics
    urc2006 2006 Urban-rural classification of the county where the monitor is located
    – 6 category variable - 1 is totally urban 6 is completely rural
    – Data from the National Center for Health Statistics
    aod Aerosol Optical Depth measurement from a NASA satellite
    – based on the diffraction of a laser
    – used as a proxy of particulate pollution
    – unit-less - higher value indicates more pollution
    – Data from NASA
    Consider this

    Take a closer look at the columns id, fips, and zcta, pay specific attention to the following questions:

    • What data type are they?
    • What information do you think they hold?
    • Does the type of data they hold match their assigned data type?
    Did it!

    [Your answer here]

    These are all currently numerical variables and R thinks they are continuous variables. However id is the monitor ID, fips is the federal information processing standard number of the country where it is located, and zcta is the zip code tabulation area.

    These should all be categorical variables and so we need to convert them to factors, which are ordered characters using the function as.factor().

    pm <- pm %>%
      mutate(across(c(id, fips, zcta), as.factor))

    Let’s dig a little bit deeper using skim().

    skim(pm)
    Name pm
    Number of rows 876
    Number of columns 50
    _______________________
    Column type frequency:
    character 3
    factor 3
    numeric 44
    ________________________
    Group variables None
    Data summary

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    state 0 1 4 20 0 49 0
    county 0 1 3 20 0 471 0
    city 0 1 4 48 0 607 0

    Variable type: factor

    skim_variable n_missing complete_rate ordered n_unique top_counts
    id 0 1 FALSE 876 100: 1, 102: 1, 103: 1, 104: 1
    fips 0 1 FALSE 569 170: 12, 603: 10, 261: 9, 107: 8
    zcta 0 1 FALSE 842 475: 3, 110: 2, 160: 2, 290: 2

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    value 0 1 10.81 2.58 3.02 9.27 11.15 12.37 23.16 ▂▆▇▁▁
    lat 0 1 38.48 4.62 25.47 35.03 39.30 41.66 48.40 ▁▃▅▇▂
    lon 0 1 -91.74 14.96 -124.18 -99.16 -87.47 -80.69 -68.04 ▃▂▃▇▃
    cmaq 0 1 8.41 2.97 1.63 6.53 8.62 10.24 23.13 ▃▇▃▁▁
    zcta_area 0 1 183173481.91 542598878.48 15459.00 14204601.75 37653560.50 160041508.25 8164820625.00 ▇▁▁▁▁
    zcta_pop 0 1 24227.58 17772.16 0.00 9797.00 22014.00 35004.75 95397.00 ▇▇▃▁▁
    imp_a500 0 1 24.72 19.34 0.00 3.70 25.12 40.22 69.61 ▇▅▆▃▂
    imp_a1000 0 1 24.26 18.02 0.00 5.32 24.53 38.59 67.50 ▇▅▆▃▁
    imp_a5000 0 1 19.93 14.72 0.05 6.79 19.07 30.11 74.60 ▇▆▃▁▁
    imp_a10000 0 1 15.82 13.81 0.09 4.54 12.36 24.17 72.09 ▇▃▂▁▁
    imp_a15000 0 1 13.43 13.12 0.11 3.24 9.67 20.55 71.10 ▇▃▁▁▁
    county_area 0 1 3768701992.12 6212829553.56 33703512.00 1116536297.50 1690826566.50 2878192209.00 51947229509.00 ▇▁▁▁▁
    county_pop 0 1 687298.44 1293488.74 783.00 100948.00 280730.50 743159.00 9818605.00 ▇▁▁▁▁
    log_dist_to_prisec 0 1 6.19 1.41 -1.46 5.43 6.36 7.15 10.45 ▁▁▃▇▁
    log_pri_length_5000 0 1 9.82 1.08 8.52 8.52 10.05 10.73 12.05 ▇▂▆▅▂
    log_pri_length_10000 0 1 10.92 1.13 9.21 9.80 11.17 11.83 13.02 ▇▂▇▇▃
    log_pri_length_15000 0 1 11.50 1.15 9.62 10.87 11.72 12.40 13.59 ▆▂▇▇▃
    log_pri_length_25000 0 1 12.24 1.10 10.13 11.69 12.46 13.05 14.36 ▅▃▇▇▃
    log_prisec_length_500 0 1 6.99 0.95 6.21 6.21 6.21 7.82 9.40 ▇▁▂▂▁
    log_prisec_length_1000 0 1 8.56 0.79 7.60 7.60 8.66 9.20 10.47 ▇▅▆▃▁
    log_prisec_length_5000 0 1 11.28 0.78 8.52 10.91 11.42 11.83 12.78 ▁▁▃▇▃
    log_prisec_length_10000 0 1 12.41 0.73 9.21 11.99 12.53 12.94 13.85 ▁▁▃▇▅
    log_prisec_length_15000 0 1 13.03 0.72 9.62 12.59 13.13 13.57 14.41 ▁▁▃▇▅
    log_prisec_length_25000 0 1 13.82 0.70 10.13 13.38 13.92 14.35 15.23 ▁▁▃▇▆
    log_nei_2008_pm25_sum_10000 0 1 3.97 2.35 0.00 2.15 4.29 5.69 9.12 ▆▅▇▆▂
    log_nei_2008_pm25_sum_15000 0 1 4.72 2.25 0.00 3.47 5.00 6.35 9.42 ▃▃▇▇▂
    log_nei_2008_pm25_sum_25000 0 1 5.67 2.11 0.00 4.66 5.91 7.28 9.65 ▂▂▇▇▃
    log_nei_2008_pm10_sum_10000 0 1 4.35 2.32 0.00 2.69 4.62 6.07 9.34 ▅▅▇▇▂
    log_nei_2008_pm10_sum_15000 0 1 5.10 2.18 0.00 3.87 5.39 6.72 9.71 ▂▃▇▇▂
    log_nei_2008_pm10_sum_25000 0 1 6.07 2.01 0.00 5.10 6.37 7.52 9.88 ▁▂▆▇▃
    popdens_county 0 1 551.76 1711.51 0.26 40.77 156.67 510.81 26821.91 ▇▁▁▁▁
    popdens_zcta 0 1 1279.66 2757.49 0.00 101.15 610.35 1382.52 30418.84 ▇▁▁▁▁
    nohs 0 1 6.99 7.21 0.00 2.70 5.10 8.80 100.00 ▇▁▁▁▁
    somehs 0 1 10.17 6.20 0.00 5.90 9.40 13.90 72.20 ▇▂▁▁▁
    hs 0 1 30.32 11.40 0.00 23.80 30.75 36.10 100.00 ▂▇▂▁▁
    somecollege 0 1 21.58 8.60 0.00 17.50 21.30 24.70 100.00 ▆▇▁▁▁
    associate 0 1 7.13 4.01 0.00 4.90 7.10 8.80 71.40 ▇▁▁▁▁
    bachelor 0 1 14.90 9.71 0.00 8.80 12.95 19.22 100.00 ▇▂▁▁▁
    grad 0 1 8.91 8.65 0.00 3.90 6.70 11.00 100.00 ▇▁▁▁▁
    pov 0 1 14.95 11.33 0.00 6.50 12.10 21.22 65.90 ▇▅▂▁▁
    hs_orless 0 1 47.48 16.75 0.00 37.92 48.65 59.10 100.00 ▁▃▇▃▁
    urc2013 0 1 2.92 1.52 1.00 2.00 3.00 4.00 6.00 ▇▅▃▂▁
    urc2006 0 1 2.97 1.52 1.00 2.00 3.00 4.00 6.00 ▇▅▃▂▁
    aod 0 1 43.70 19.56 5.00 31.66 40.17 49.67 143.00 ▃▇▁▁▁

    Data summary feature data set for air pollution monitoring.

    Consider this

    Here are a few starting points to explore your data set more in depth:

    • How much missing data is there?
    • What data types are represented?
    • How many states are represented?
    • Are there variables with large ranges? small ranges?
    • Are there variables with normal distributions? very uneven distributions?

    For each question also explain how you figured out the answer)

    Did it!

    [Your answer here]

    Next step is to look a little bit deeper at what specific information is in each column to make sure we are considering confounding variables or biases in the data set.

    Consider this

    Here are a few starting points to explore:

    • What states are included?
    • what cities have the highest number of air monitors?

    For each question include the code you used to get the answer:

    Did it!

    [Your answer here]

    What states are included?

    state
    Alabama
    Arizona
    Arkansas
    California
    Colorado
    Connecticut
    Delaware
    District Of Columbia
    Florida
    Georgia
    Idaho
    Illinois
    Indiana
    Iowa
    Kansas
    Kentucky
    Louisiana
    Maine
    Maryland
    Massachusetts
    Michigan
    Minnesota
    Mississippi
    Missouri
    Montana
    Nebraska
    Nevada
    New Hampshire
    New Jersey
    New Mexico
    New York
    North Carolina
    North Dakota
    Ohio
    Oklahoma
    Oregon
    Pennsylvania
    Rhode Island
    South Carolina
    South Dakota
    Tennessee
    Texas
    Utah
    Vermont
    Virginia
    Washington
    West Virginia
    Wisconsin
    Wyoming
    Table 25.1: States with available data.

    Cities with largest number of air monitors?

    city n
    Not in a city 103
    New York 9
    Cleveland 6
    Baltimore 5
    Chicago 5
    Detroit 5
    Milwaukee 5
    New Haven 5
    Philadelphia 5
    Springfield 5
    Boston 4
    Columbus 4
    Indianapolis (Remainder) 4
    Louisville 4
    St. Louis 4
    Washington 4
    Table 25.2: Number of air monitors located in each city compared to rural areas.

    We should also evaluate to which extent our predictor variables are correlated.

    Consider this

    Consider why it could be problematic if we had highly correlated predictor variables in our feature set.

    Did it!

    [Your answer here]

    When we use linear regressions and predictor variables are correlated they end up predicting each other rather than the outcome variable. This is generally referred to as multicollinearity.

    Additionally, including redundant variables can add unnecessary noise to our model which can end up decreasing the accuracy of our precipitations and slow down our model.

    Correlation among predictor variables can also make it difficult to understand which of them are actually predictive.

    We’ll start by making pairwise comparisons among all of our numeric (continuous) variables using corrplot.

    # calculate Pearson's coefficients
    PM_cor <- pm %>%
      select_if(is.numeric) %>%
      cor()
      
    # plot pairwise correlation matrix
    corrplot(PM_cor, tl.cex = 0.5)

    Figure 25.1: Pairwise comparison of correlation coefficients for all continuous variables.

    Let’s re-plot that visualization using the absolute values of the Pearson correlation coefficient5 and by organizing our variables using hierarchical clustering, i.e. variables more similar to each other will be closer to each other.

  • 5 We are not interested in the direction of the correlation only how strong it is

  • corrplot(abs(PM_cor), 
             order = "hclust", 
             tl.cex = 0.5, 
             col.lim = c(0, 1))

    Figure 25.2: Pairwise comparison of correlation coefficients for all continuous variables. Size and color intensity of circles represent the strength of these relationships.

    We can now see that there are quite a few variables correlated with our outcome variable (value). Unsurprisingly, we also have sets of variables describing the characteristic of the area corresponding to the locations of the emissions values, primarily the variables that contain imp_*, *pri_*, *nei_*.

    Consider this

    Look up what these variables are and argue whether or not it makes sense for these sets of values to be correlated to the PM2.5 value. What other variables might be good indicators?

    Did it!

    [Your answer here]

    We can consider the set of imp variables as indicative of development/urbanization, the pri variables describe road density which can be an indicator of the amount of cars, an the nei variables describe emissions.

    We can take a closer look at sets of variables we are interested in using the ggcorr() and ggpairs() functions from the GGally package.

    Let’s start with our variables describing measuring how impervious the surface is:

    # plot pairwise 
    pm %>%
      select(contains("imp")) %>%
      ggcorr(label = TRUE)

    Figure 25.3: Pairwise comparison of correlation coefficients for all variables related to levels of development, i.e. level of surface imperviousness.

    And let’s look even a little bit closer look at the relationship of these variables.visualizing the relationships using scatter plots, the distribution of values for each variable using density plots, along with the correlation coefficients.

    # plot scatter plots, density plots & correlation coefficients
    pm %>%
      select(contains("imp")) %>%
      ggpairs() 

    Figure 25.4: Scatterplots describing the relationships between all combinations of values describing level of development for buffers surrounding each air pollution monitor (below the diagonal) and corresponding Pearson’s correlation coefficients (above the diagonal), the distribution of values for each variable is on the diagonal.
    Consider this

    Think through the results of your exploration of existing correlations:

    • Which of these pairwise comparisons have the highest level of correlation?
    • Which have the lowest?
    • Do these patterns make sense?
    Did it!

    [Your answer here]

    Next, let’s take a peak at our road density data

    pm %>%
      select(contains("pri")) %>%
      ggpairs()

    Figure 25.5: Scatterplots describing the relationships between all combinations of values describing road density for buffers surrounding each air pollution monitor (below the diagonal) and corresponding Pearson’s correlation coefficients (above the diagonal), the distribution of values for each variable is on the diagonal.
    Consider this

    Think through the results of your exploration of existing correlations:

    • Which of these pairwise comparisons have the highest level of correlation?
    • Which have the lowest?
    • Do these patterns make sense?
    Did it!

    [Your answer here]

    Finally, let’s look at the emissions variables:

    pm %>%
      select(contains("nei")) %>%
      ggpairs()

    Figure 25.6: Scatterplots describing the relationships between all combinations of variables measuring air pollution for buffers surrounding each air pollution monitor (below the diagonal) and corresponding Pearson’s correlation coefficients (above the diagonal), the distribution of values for each variable is on the diagonal.
    Consider this

    Think through the results of your exploration of existing correlations:

    • Which of these pairwise comparisons have the highest level of correlation?
    • Which have the lowest?
    • Do these patterns make sense?
    Did it!

    [Your answer here]

    Since we now know that the variables within these categories are correlated to each other, let’s select one of each for the same buffer size and compare how they are correlated to each other as well as population density.

    pm %>%
    select(log_nei_2008_pm25_sum_10000, popdens_county, 
           log_pri_length_10000, imp_a10000, county_pop) %>%
      ggpairs()

    Figure 25.7: Scatterplots describing the relationships between range of variables describing development, road density, emission levels, and population density for buffers surrounding each air pollution monitor (below the diagonal) and corresponding Pearson’s correlation coefficients (above the diagonal), the distribution of values for each variable is on the diagonal.
    Consider this

    Think through the results of your exploration of existing correlations:

    • Which of these pairwise comparisons have the highest level of correlation?
    • Which have the lowest?
    • Do these patterns make sense?
    Did it!

    [Your answer here]

    Some of our variables have extreme variability; one way to do this is to perform a log transformation which might effect our correlations so we should take a quick look.

    pm %>%
      mutate(log_popdens_county = log(popdens_county)) %>%
      mutate(log_pop_county = log(county_pop)) %>%
      select(log_nei_2008_pm25_sum_10000, log_popdens_county, 
           log_pri_length_10000, imp_a10000, log_pop_county) %>%
      ggpairs()

    Figure 25.8: Scatterplots describing the relationships between range of variables describing development, road density, emission levels, and population density for buffers surrounding each air pollution monitor (below the diagonal) and corresponding Pearson’s correlation coefficients (above the diagonal), the distribution of values for each variable is on the diagonal.

    This does increase our correlation levels a bit but overall these variables do not appear to be highly correlated, so we should make sure keep a variable from each category to use as predictor values for our model.

    Finally, let’s add an additional categorical variable that tells us whether or not a monitor is in a city - this will be helpful down the line.

    pm <- pm %>%
      mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                              city != "Not in a city" ~ "In a city"))